# load libraries
library(tidyverse)
library(LightLogR)
library(gt)Tutorial on analysis pipelines for visual experience datasets
1 Abstract
This tutorial presents an analysis pipeline for visual experience datasets, with a focus on reproducible workflows tailored for use in human chronobiology and myopia research. Light exposure and its retinal encoding affect human physiology and behaviour over various time scales. Here, we provide step-by-step instructions for importing, visualising, and processing light exposure data using the open-source tool LightLogR. This includes time-series analysis for working distance, spectral characteristics, and biologically relevant light metrics. By leveraging a modular approach, this tutorial supports researchers in building flexible and robust pipelines that can accommodate diverse experimental paradigms and measurement systems.
2 Introduction
Exposure to the optical environment — often referred to as visual experience — profoundly influences human physiology and behaviour across multiple time scales. Two notable examples, though from distinct research domains, can be understood through a common, retinally-referenced conceptual framework.
The first relates to the non-visual effects of light on human circadian and neuroendocrine physiology. The light–dark cycle entrains the circadian clock, and light exposure during the night suppresses melatonin production (Brown et al. 2022; Blume, Garbazza, and Spitschan 2019).
The second concerns the influence of visual experience on ocular development, particularly myopia. Time spent outdoors — characterised by distinct optical environments — has been consistently associated with protective effects on ocular growth and health outcomes (Dahlmann-Noor et al. 2025).
In controlled laboratory settings, light exposure can be held constant or manipulated parametrically. However, such exposures rarely replicate real-world conditions, which are inherently complex and dynamic. As people move in and between spaces (indoors and outdoors) and move their trunks, heads and eyes, the exposure to the optical environment varies significantly (Webler et al. 2019), and is modulated by behaviour (Biller, Balakrishnan, and Spitschan 2024). Wearable devices for measuring light exposure have thus emerged as vital tools in capturing the richness of ecological visual experience. These tools generate high-dimensional datasets that demand rigorous and flexible analysis strategies.
Starting in the 1980s (Okudaira, Kripke, and Webster 1983), technology to measure exposure to the optical environment has been developed and matured, with miniaturized illuminance sensors now (2025) being very common in consumer smartwatches. In research, several device and device types are available, which differ in their functionality, ranging from small pin-like devices measuring light exposure (Mohamed et al. 2021) to head-mounted multi-modal measurement devices capturing almost all relevant aspects of visual experience (Gibaldi et al. 2024). With the increased technical capabilities in wearables come considerably complex and dense datasets. These go hand in hand with an overwhelming amount of metrics, as revealed by review papers in both fields.
At present, the analysis processes to derive metrics are often implemented on a by-workgroup, or even by-researcher basis, which is both a potential source of errors and inconsistencies between publications, and also a considerable time sink for researchers (Hartmeyer, Webler, and Andersen 2022). Too often, more time is spent preparing the data than to actually gain insights through rigoruous statistical testing and exploration. The preparation tasks are best handled or at least facilitated by standardized, transparent, and community-based pipelines for analysis (L. A. S. Zauner Johannes AND Udovicic 2024).
In circadian research, the package LightLogR for R statistical software was developed (J. Zauner, Hartmeyer, and Spitschan 2025). LightLogR is an open-source, MIT licensed, and community-driven package specifically made to work with data from wearable light loggers, and optical radiation dosimeters. It also contains functions to calculate over sixty different metrics used in the field of research (Hartmeyer and Andersen 2023). In a recent update the package was significantly expanded to deal with modalities beyond illuminance, like distance or even light spectra, which are highly relevant for myopia research (Hönekopp and Weigelt 2023).
In this article we show that the analysis pipelines and metric functions in LightLogR naturally apply to the whole field of visual experience, not just circadian research and chronobiology. Our approach is modular and extensible, allowing researchers to adapt it to a variety of devices and research questions. Emphasis is placed on clarity, transparency, and reproducibility, aligning with best practices in scientific computing and open science. We use data from two devices to demonstrate the LightLogR workflow and output with metrics relevant in the field of myopia, covering metrics for working distance, daylight exposure, and spectral analyses. It is recommended to recreate the analysis in this script. All necessary data and code are provided under an open license in the GitHub repository.
3 Methods and materials
3.1 Software
This tutorial is built with Quarto, an open-source scientific and technical publishing system, integrating text, code, and code-output into a single document. The source-code to reproduce the outcomes is part of the document and accessible via the code-tools menu.
Package LightLogR (Version 0.9.0 “Sunrise”) was used with R statistical software (Version 4.4.3 “Trophy Case”). We further used the tidyverse package (Version 2.0.0) for principled data analysis, which LightLogR follows. Finally, the gt package (Version 1.0.0) was used for table generation. A comprehensive overview of the R computing environment can be found in the session info
3.2 Metric selection and definitions
In March of 2025, two workshops with researchers in the field of myopia, initiated by the Research Data Alliance (RDA) Working Group on Optical Radiation Exposure and Visual Experience Data focused on the current needs and future opportunities regarding data analysis, including metrics. Out of the expert inputs in these workshops, a list of visual experience metrics was collected, which is shown in Table 1. These include currently used metrics and definitions (Wen et al. 2020, 2019; Bhandari and Ostrin 2020; Williams et al. 2019), but also new metrics that are possible through spectrally-resolved measurements.
| No. | Name | Implementation1 |
|---|---|---|
| Distance | ||
| 1 | Total wear time daily | durations() |
| 2 | Duration of per each Distance range |
filter for distance range + durations() (for single ranges) or grouping by distance range + durations() (for all ranges) |
| 3 | Frequency of Continuous near work |
|
| 4 | Frequency, duration, and distances of Near Work episodes |
|
| 5 | Frequency and duration of Visual breaks |
filter |
| Light | ||
| 6 | Light exposure (in lux) | summarize_numeric() |
| 7 | Duration per Outdoor range | grouping by Outdoor range + |
| 8 | The number of times light level changes from indoor (<1000 lx) to outdoor (>1000 lx) | |
| 9 | Longest period above 1000 lx | period_above_threshold() |
| Spectrum | ||
| 10 | Ratio of short vs. long wavelength light | |
| 11 | Short-wavelength light at certain times of day |
filter_Time() (for defined times) or grouping by time state + |
Table 2 contains definitions for the terms in Table 1. Note that these definitions may vary depending on the research question or device capabilities.
| Metric | Description / pseudo formula |
|---|---|
| Total wear time | \(\sum(t)*dt, \textrm{ where } t\textrm{: valid observations }\) |
| Mean daily | \(\frac{5*\bar{\textrm{weekday}} + 2*\bar{weekend}}{7}\) |
| Near work | \(\textrm{working distance}, [10,60)cm\) |
| Intermediate Work | \(\textrm{working distance}, [60,100)cm\) |
| Total work | \(\textrm{working distance}, [10,120)cm\) |
| Distance range | \(\textrm{working distance}, {[10,20)cm \textrm{, Extremely near} \\ [20,30)cm \textrm{, Very near} \\ [30,40)cm \textrm{, Fairly near} \\ [40,50)cm \textrm{, Near} \\ [50,60)cm \textrm{, Moderately near} \\ [60,70)cm \textrm{, Near intermediate} \\ [70,80)cm \textrm{, Intermediate} \\ [80,90)cm \textrm{, Moderately intermediate} \\ [90,100)cm \textrm{, Far intermediate}}\) |
| Continuous near work | \(\textrm{working distance}, [20,60)cm,\) \(T_\textrm{duration} ≥ 30 minutes, \textrm{ }T_{interruptions} ≤ 1 minute\) |
| Near work episodes | \(\textrm{working distance}, [20,60)cm,\) \(T_\textrm{interruptions} ≤ 20 seconds\) |
| Ratio of daily near work | \(\frac{T_\textrm{near work}}{T_\textrm{total wear}}\) |
| Visual break | \(\textrm{working distance} ≥ 100cm, \\ T_\textrm{duration} ≥ 20 seconds, \textrm{ }T_\textrm{previous episode} ≤ 20 minutes\) |
| Outdoor range | \(\textrm{illuminance}, {[1000,2000)lx \textrm{, Outdoor bright} \\ [2000,3000)lx \textrm{, Outdoor very bright} \\ [3000, \infty) lx \textrm{, Outdoor extremely bright}}\) |
| Light exposure2 | \(\bar{illuminance}\) |
| Spectral bands | \(\textrm{spectral irradiance}, {[380,500]nm \textrm{, short wavelength light} \\ [600, 780]nm \textrm{, long wavelength light}}\) |
| Ratio of short vs. long wavelength light | \(\frac{E_{e\textrm{,short wavelength}}}{E_{e\textrm{,long wavelength}}}\) |
3.3 Devices
Data from two devices will be used for analysis:
Distance and light metrics will be calculated based on export from the
Clouclipdevice (Glasson Technology Co., Ltd, Hangzhou, China, Wen et al. 2021, 2020). This device has a simple output of onlyDistanceandIlluminancemeasurements. Data were recorded in5-second intervals. A weeks worth of data takes up about 1.6 MB of storage.Spectrum metrics will be calculated using data from a multi-modal device, the
Visual Environment Evaluation ToolorVEET(Meta Platforms, Inc., Menlo Park, California, USA, Sah, Narra, and Ostrin 2025). This dense dataset contains distance (spatially resolved), light, activity (accellerometer & gyroscope), and spectrum measurements, recorded in2-second intervals. A weeks worth of data takes up about 270 MB of storage.
3.4 Data import & preparation
This tutorial will start by importing a Clouclip dataset and providing an overview of the data. The Clouclip export is considerably simpler compared to the VEET device, only containing Distance and Illuminance measurements. The VEET dataset will be imported later for the spectrum related metrics.
LightLogR provides accessible import functionality for many wearable devices (18 at the time of writing). Required information are the file(s) and the time zone the device was set up with/recorded in (default is UTC). Many optional arguments let a user, e.g., extract IDs from the file name or correct for daylight savings jumps. The import also provides a comprehensive overview of the data, letting the user know of any gaps and irregularities in the data.
# import the data
path <- "data/Sample_Clouclip.csv"
tz <- "US/Central"
dataCC <- import$Clouclip(path, tz = tz, manual.id = "Clouclip")
Successfully read in 58'081 observations across 1 Ids from 1 Clouclip-file(s).
Timezone set is US/Central.
The system timezone is Europe/Berlin. Please correct if necessary!
First Observation: 2021-02-06 17:12:47
Last Observation: 2021-02-14 17:12:36
Timespan: 8 days
Observation intervals:
Id interval.time n pct
1 Clouclip 5s 54572 93.9601%
2 Clouclip 17s 12 0.0207%
3 Clouclip 18s 14 0.0241%
4 Clouclip 120s (~2 minutes) 3479 5.9900%
5 Clouclip 128s (~2.13 minutes) 1 0.0017%
6 Clouclip 132s (~2.2 minutes) 1 0.0017%
7 Clouclip 133s (~2.22 minutes) 1 0.0017%
3.4.1 Exploration
It seems there are many gaps in the data. Understanding how these relate to the measurements and the time of recording is essential. LightLogR provides tools to visualize and summarize these gaps.
In the presence of irregular data, i.e. data that does not fall in a regular sequence of datetimes, gap summaries can be computationally very expensive and inaccurate. For that reason (if it is not already visible from the import summary, it makes sense to check for irregular data)
dataCC |> has_irregulars() #test for irregulars[1] TRUE
In the case of irregular data, it is recommended to visualize irregulars without recalculating implicit gaps, which are missing observations at regular intervals. See Figure 1.
y.label <- "Distance (cm)"
dataCC |> gg_gaps(Dis,
include.implicit.gaps = FALSE,
show.irregulars = TRUE,
y.axis.label = y.label,
group.by.days = TRUE
)It looks like data in every day but the first and last are considered irregular. This happens with some devices and requires manual handling. Strategies include:
Removing some intervals from the start if the irregularities are due to the setup process. See filter_Date() for a way to remove these. This is usually a good solution if only the first day has regular data and the rest is all irregular.
Rounding datetime values to the closest (5 second) interval. See cut_Datetime() for a helper function. This is appropriate if deviations from the dominant interval (5 seconds in this case) are infrequent, all deviations are larger than the dominant interval, and rounded datetimes don’t lead to duplicated datetetimes.
Aggregating data into a coarser recording interval. See aggregate_Datetime() for this option. This is appropriate in most cases, but leads to a loss in granularity.
Based on the import summary, and the graph, we use the second option to deal with the irregular data.
# round observation times to the next 5-second interval
dataCC <-
dataCC |>
cut_Datetime("5 secs", New.colname = Datetime) |>
group_by(Day = date(Datetime))# summarize the data
dataCC |> gap_table(Dis, Variable.label = "Distance (cm)")| Summary of available and missing data | |||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Variable: Distance (cm) | |||||||||||||||||||
Data
|
Missing
|
||||||||||||||||||
Regular
|
Irregular
|
Range
|
Interval
|
Gaps
|
Implicit
|
Explicit
|
|||||||||||||
| Time | % | n1 | n2,1 | Time | n1 | Time | N | ø | øn1 | Time | % | n1 | Time | % | n1 | Time | % | n1 | |
| Overall | 2d 14h 43m 10s | 29.0%3 | 45,158 | 0 | 1w 2d | 155,520 | 5 | 2,690 | 1h 35m 57s | 1,151 | 6d 9h 16m 50s | 71.0%3 | 110,362 | 5d 15h 19m 55s | 62.7%3 | 97,439 | 17h 56m 55s | 8.3%3 | 12,923 |
| 2021-02-06 | |||||||||||||||||||
| 43m 30s | 3.0% | 522 | 0 | 1d | 17,280 | 5s | 26 | 53m 43s | 645 | 23h 16m 30s | 97.0% | 16,758 | 22h 53m 20s | 95.4% | 16,480 | 23m 10s | 1.6% | 278 | |
| 2021-02-07 | |||||||||||||||||||
| 2h 45m | 11.5% | 1,980 | 0 | 1d | 17,280 | 5s | 139 | 9m 10s | 110 | 21h 15m | 88.5% | 15,300 | 19h 42m 35s | 82.1% | 14,191 | 1h 32m 25s | 6.4% | 1,109 | |
| 2021-02-08 | |||||||||||||||||||
| 11h 13m 55s | 46.8% | 8,087 | 0 | 1d | 17,280 | 5s | 443 | 1m 44s | 21 | 12h 46m 5s | 53.2% | 9,193 | 10h 47m 50s | 45.0% | 7,774 | 1h 58m 15s | 8.2% | 1,419 | |
| 2021-02-09 | |||||||||||||||||||
| 8h 46m 25s | 36.6% | 6,317 | 0 | 1d | 17,280 | 5s | 278 | 3m 17s | 39 | 15h 13m 35s | 63.4% | 10,963 | 13h 50m | 57.6% | 9,960 | 1h 23m 35s | 5.8% | 1,003 | |
| 2021-02-10 | |||||||||||||||||||
| 7h 1m 30s | 29.3% | 5,058 | 0 | 1d | 17,280 | 5s | 367 | 2m 47s | 33 | 16h 58m 30s | 70.7% | 12,222 | 14h 18m 40s | 59.6% | 10,304 | 2h 39m 50s | 11.1% | 1,918 | |
| 2021-02-11 | |||||||||||||||||||
| 8h 31m 55s | 35.5% | 6,143 | 0 | 1d | 17,280 | 5s | 423 | 2m 12s | 26 | 15h 28m 5s | 64.5% | 11,137 | 11h 43m 30s | 48.9% | 8,442 | 3h 44m 35s | 15.6% | 2,695 | |
| 2021-02-12 | |||||||||||||||||||
| 12h 17m 55s | 51.2% | 8,855 | 0 | 1d | 17,280 | 5s | 417 | 1m 41s | 20 | 11h 42m 5s | 48.8% | 8,425 | 9h 17m 45s | 38.7% | 6,693 | 2h 24m 20s | 10.0% | 1,732 | |
| 2021-02-13 | |||||||||||||||||||
| 10h 32m 15s | 43.9% | 7,587 | 0 | 1d | 17,280 | 5s | 527 | 1m 32s | 18 | 13h 27m 45s | 56.1% | 9,693 | 10h 42m 10s | 44.6% | 7,706 | 2h 45m 35s | 11.5% | 1,987 | |
| 2021-02-14 | |||||||||||||||||||
| 50m 45s | 3.5% | 609 | 0 | 1d | 17,280 | 5s | 70 | 19m 51s | 238 | 23h 9m 15s | 96.5% | 16,671 | 22h 4m 5s | 92.0% | 15,889 | 1h 5m 10s | 4.5% | 782 | |
| 1 Number of (missing or actual) observations | |||||||||||||||||||
| 2 If n > 0: it is possible that the other summary statistics are affected, as they are calculated based on the most prominent interval. | |||||||||||||||||||
| 3 Based on times, not necessarily number of observations | |||||||||||||||||||
Table 3 shows that there are no more irregular data after treatment. There are, however, considerable implicitly missing data, which can be converted to explicitly missing data with gap_handler(), which will make calculations based on the dataset much more robust. Furthermore, there are two days that have less than an hours worth of data. These will be removed.
#make implicit gaps explicit
dataCC <-
dataCC |>
gap_handler(full.days = TRUE) |> #make gaps explicit
remove_partial_data(Dis, threshold.missing = "23 hours")The Clouclip device uses sentinel values to encode states in the measurement values. LightLogR converts these to a dedicated column, and we can visualize them, alongside showing the photoperiod.
#setting coordinates for Houston, Texas
coordinates <- c(29.75, -95.36)
# visualize observations
dataCC |>
gg_day(y.axis = Dis, geom = "line", y.axis.label = y.label) |> #create a basic plot
gg_state(Dis_status, aes_fill = Dis_status) |> #add the status times
gg_photoperiod(coordinates) + #add the photoperiod (day/night)
theme(legend.position = "bottom")With these data, the metrics can be calculated.
4 Results
4.1 Distance
In the following sections, daily values are calculated. The following helper function takes these daily values and calculates averages for weekend, weekday, and mean daily:
to_mean_daily <- function(data, prefix = "average_") {
data |>
ungroup(Day) |> #ungroup by days
mean_daily(prefix = prefix) |> #calculate the averages
rename_with(.fn = \(x) str_replace_all(x,"_"," ")) |> #remove underscores
gt() #table output
}4.1.1 Total wear time daily
For Total wear time daily, only instances where there is actual distance data available will be taken into account in Table 4.
dataCC |>
durations(Dis) |> #calculate the durations per group (day)
to_mean_daily("Total wear ")| Day | Total wear duration |
|---|---|
| Mean daily | 31448s (~8.74 hours) |
| Weekday | 34460s (~9.57 hours) |
| Weekend | 23918s (~6.64 hours) |
4.1.2 Duration within distance ranges
This metric can be calculated in two ways. Table 5 shows the duration of near work, whereas Table 6 shows the duration of distance ranges.
dataCC |>
filter(Dis >= 10, Dis < 60) |>
durations(Dis) |>
to_mean_daily("Near work ")| Day | Near work duration |
|---|---|
| Mean daily | 22586s (~6.27 hours) |
| Weekday | 26343s (~7.32 hours) |
| Weekend | 13192s (~3.66 hours) |
#cutting distance into bands
dist_breaks <- c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100, Inf)
dist_labels <- c(
"Extremely near", # [10, 20)
"Very near", # [20, 30)
"Fairly near", # [30, 40)
"Near", # [40, 50)
"Moderately near", # [50, 60)
"Near Intermediate", # [60, 70)
"Intermediate", # [70, 80)
"Moderately intermediate", # [80, 90)
"Far intermediate", # [90, 100)
"Far" # [100, Inf)
)
dataCC |>
mutate(Dis_range =
cut(Dis, breaks = dist_breaks, labels = dist_labels) #create ranges
) |>
drop_na(Dis_range) |> #remove NAs
group_by(Dis_range, .add = TRUE) |> #group by ranges
durations(Dis) |> #calculate durations
pivot_wider(names_from = Dis_range, values_from = duration) |> #widen data
to_mean_daily("") |>
fmt_duration(input_units = "seconds", output_units = "minutes") #show minutes| Day | Extremely near | Very near | Fairly near | Near | Moderately near | Near Intermediate | Intermediate | Moderately intermediate | Far intermediate | Far |
|---|---|---|---|---|---|---|---|---|---|---|
| Mean daily | 169m | 102m | 46m | 27m | 13m | 7m | 4m | 5m | 11m | 16m |
| Weekday | 180m | 128m | 60m | 36m | 16m | 7m | 6m | 6m | 14m | 20m |
| Weekend | 141m | 38m | 12m | 5m | 5m | 8m | 1m | 3m | 2m | 5m |
Figure 3 shows the distribution of relative times within each distance range.
4.1.3 Frequency of Continuous near work
Continuous near work has more than one condition. Beyond a distance range, it requires a certain length, but also allows for interruptions. This is what extract_clusters() allows.
Table 7 summarizes the results, and Figure 4 visualizes them.
dataCC |>
extract_clusters(Dis >= 20 & Dis < 60, #define the condition
cluster.duration = "30 mins", #define the minimum duration
interruption.duration = "1 min") |> #define max interruption
summarize_numeric(remove = c("start", "end", "epoch", "duration"),
add.total.duration = FALSE) |> #count the number of episodes
mean_daily(prefix = "Frequency of ") |># daily means
gt() |> fmt_number() #table| Day | Frequency of episodes |
|---|---|
| Mean daily | 0.86 |
| Weekday | 1.20 |
| Weekend | 0.00 |
Warning: Removed 65357 rows containing missing values or values outside the scale range
(`geom_line()`).
4.1.4 Near Work episodes
This section of the metrics consists of three aspects: Frequency, Duration, and Distances. The first two aspects are collected the same way as in the previous section, whereas the Distance aspect is extracted from the base data. All are summarized in Table 8
dataCC |>
extract_clusters(Dis >= 20 & Dis < 60, #define the condition
cluster.duration = "5 secs", #define the minimum duration
interruption.duration = "20 secs") |> #define max interruption
extract_metric(dataCC, distance = mean(Dis, na.rm = TRUE)) |>
summarize_numeric(remove = c("start", "end", "epoch"), prefix = "",
add.total.duration = FALSE) |> #count the number of episodes
mean_daily(prefix = "") |> #daily means
gt() |> fmt_number(c(distance, episodes), decimals = 0) #table| Day | duration | distance | episodes |
|---|---|---|---|
| Mean daily | 233s (~3.88 minutes) | 32 | 57 |
| Weekday | 284s (~4.73 minutes) | 32 | 64 |
| Weekend | 104s (~1.73 minutes) | 32 | 40 |
4.1.5 Visual breaks
Visual breaks are a little different, compared to the previous metrics. The difference is that in this case, the minimum break and the previous episode is important. This leads to a two step process, where we first extract instances of Distance above 100 cm for at least 20 seconds, before we filter for a previous duration of at maximum 20 minutes. Table 9 provides the daily frequency of visual breaks.
dataCC |>
extract_clusters(Dis <= 100, #define the condition
cluster.duration = "20 secs", #define the minimum duration
return.only.clusters = FALSE) |> #return non-clusters as well
filter((start - lag(end) <= duration("20 mins")), is.cluster) |> # return only
#clusters with previous episode lengths of maximum 20 minutes
summarize_numeric(remove = c("start", "end", "epoch", "is.cluster", "duration"),
prefix = "",
add.total.duration = FALSE) |> #count the number of episodes
mean_daily(prefix = "Daily ") |> #daily means
gt() |> fmt_number(decimals = 0) #table| Day | Daily episodes |
|---|---|
| Mean daily | 170 |
| Weekday | 183 |
| Weekend | 136 |
4.2 Light
Illuminance values are very low in the example dataset from the Clouclip device, which would not yield satisfying summaries in the Light section. Thus, we will import data from the VEET device next. Because there are different modalities stored in the data, we need to specify which modality we want to access. ALS is the acronym for Ambient Light Sensor.
path <- "data/01_VEET_L.csv"
tz <- "US/Central"
dataVEET <- import$VEET(path, tz = tz, modality = "ALS", manual.id = "VEET")Warning: There was 1 warning in `dplyr::mutate()`.
ℹ In argument: `dplyr::across(...)`.
Caused by warning:
! NAs introduced by coercion
Successfully read in 304'193 observations across 1 Ids from 1 VEET-file(s).
Timezone set is US/Central.
The system timezone is Europe/Berlin. Please correct if necessary!
1 observations were dropped due to a missing or non-parseable Datetime value (e.g., non-valid timestamps during DST jumps).
First Observation: 2024-06-04 15:00:37
Last Observation: 2024-06-12 08:29:43
Timespan: 7.7 days
Observation intervals:
Id interval.time n pct
1 VEET 0s 1 0.00033%
2 VEET 1s 1957 0.64334%
3 VEET 2s 300147 98.67025%
4 VEET 3s 2074 0.68181%
5 VEET 4s 3 0.00099%
6 VEET 9s 5 0.00164%
7 VEET 10s 3 0.00099%
8 VEET 109s (~1.82 minutes) 1 0.00033%
9 VEET 59077s (~16.41 hours) 1 0.00033%
This dataset has gaps and irregular data, similarly to the Clouclip data. For consistency, We will aggregate the data to 5-second intervals and set gaps explicitly. We will also remove days that have more than one hour missing data. Remaining are six days with good data coverage, as seen in Table 10.
dataVEET <-
dataVEET |>
aggregate_Datetime(unit = "5 seconds") |> #aggregate to 5 second interval
gap_handler(full.days = TRUE) |> #set implicit gaps to explicit gaps
group_by(Day = date(Datetime)) |> #group data by day
remove_partial_data(Lux, threshold.missing = "1 hour") #remove bad days
dataVEET |> gap_table(Lux, "Illuminance (lx)")| Summary of available and missing data | |||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Variable: Illuminance (lx) | |||||||||||||||||||
Data
|
Missing
|
||||||||||||||||||
Regular
|
Irregular
|
Range
|
Interval
|
Gaps
|
Implicit
|
Explicit
|
|||||||||||||
| Time | % | n1 | n2,1 | Time | n1 | Time | N | ø | øn1 | Time | % | n1 | Time | % | n1 | Time | % | n1 | |
| Overall | 5d 23h 57m 40s | 100.0%3 | 103,652 | 0 | 6d | 103,680 | 5 | 8 | 58s | 12 | 2m 20s | 0.0%3 | 28 | 0s | 0.0%3 | 0 | 2m 20s | 0.0%3 | 28 |
| 2024-06-06 | |||||||||||||||||||
| 23h 58m 5s | 99.9% | 17,257 | 0 | 1d | 17,280 | 5s | 3 | 38s | 8 | 1m 55s | 0.1% | 23 | 0s | 0.0% | 0 | 1m 55s | 0.1% | 23 | |
| 2024-06-07 | |||||||||||||||||||
| 1d | 100.0% | 17,280 | 0 | 1d | 17,280 | 5s | 0 | 0s | 0 | 0s | 0.0% | 0 | 0s | 0.0% | 0 | 0s | 0.0% | 0 | |
| 2024-06-08 | |||||||||||||||||||
| 23h 59m 55s | 100.0% | 17,279 | 0 | 1d | 17,280 | 5s | 1 | 5s | 1 | 5s | 0.0% | 1 | 0s | 0.0% | 0 | 5s | 0.0% | 1 | |
| 2024-06-09 | |||||||||||||||||||
| 23h 59m 50s | 100.0% | 17,278 | 0 | 1d | 17,280 | 5s | 2 | 5s | 1 | 10s | 0.0% | 2 | 0s | 0.0% | 0 | 10s | 0.0% | 2 | |
| 2024-06-10 | |||||||||||||||||||
| 23h 59m 55s | 100.0% | 17,279 | 0 | 1d | 17,280 | 5s | 1 | 5s | 1 | 5s | 0.0% | 1 | 0s | 0.0% | 0 | 5s | 0.0% | 1 | |
| 2024-06-11 | |||||||||||||||||||
| 23h 59m 55s | 100.0% | 17,279 | 0 | 1d | 17,280 | 5s | 1 | 5s | 1 | 5s | 0.0% | 1 | 0s | 0.0% | 0 | 5s | 0.0% | 1 | |
| 1 Number of (missing or actual) observations | |||||||||||||||||||
| 2 If n > 0: it is possible that the other summary statistics are affected, as they are calculated based on the most prominent interval. | |||||||||||||||||||
| 3 Based on times, not necessarily number of observations | |||||||||||||||||||
4.2.1 Light exposure
Averages of light exposure can be calculated with just summarize_numeric(). See Table 11
dataVEET |>
select(Day, Datetime, Lux) |>
summarize_numeric(prefix = "mean ", remove = c("Datetime")) |>
mean_daily(prefix = "") |>
gt() |> fmt_number(decimals = 1) |> cols_hide(episodes) #table| Day | mean Lux |
|---|---|
| Mean daily | 304.1 |
| Weekday | 357.8 |
| Weekend | 169.8 |
As light exposure data is highly skewed and zero-inflated, however, a transformation is sensible for the mean to be meaningful. The resulting illuminance is commonly much lower, due to the skew and the influence of zero values, as can be seen in Table 12. log_zero_inflated() solves this by adding a small value to the dataset prior to logarithmic transformation. exp_zero_inflated() does the opposite.
dataVEET |>
select(Day, Datetime, Lux) |>
mutate(Lux = Lux |> log_zero_inflated()) |> #convert to logarithmic data
summarize_numeric(prefix = "mean ", remove = c("Datetime")) |>
mean_daily(prefix = "") |>
mutate(`mean Lux` = `mean Lux` |> exp_zero_inflated()) |>
gt() |> fmt_number(decimals = 1) |> cols_hide(episodes) #table| Day | mean Lux |
|---|---|
| Mean daily | 6.3 |
| Weekday | 7.9 |
| Weekend | 3.5 |
4.2.2 Duration per outdoor range
The same way how distance ranges were calculated, illuminance ranges are summarized, displayed in Table 13.
#cutting distance into bands
out_breaks <- c(1:3*10^3, Inf)
out_labels <- c(
"Outdoor bright", # [1000, 2000)
"Outdoor very bright", # [2000, 3000)
"Outdoor extremely bright" # [3000, Inf)
)
dataVEET <-
dataVEET |>
mutate(Lux_range =
cut(Lux, breaks = out_breaks, labels = out_labels) #create ranges
)
dataVEET |>
drop_na(Lux_range) |> #remove NAs
group_by(Lux_range, .add = TRUE) |> #group by ranges
durations(Lux) |> #calculate durations
pivot_wider(names_from = Lux_range, values_from = duration) |> #widen data
to_mean_daily("") |>
fmt_duration(input_units = "seconds", output_units = "minutes") #show minutes| Day | Outdoor bright | Outdoor very bright | Outdoor extremely bright |
|---|---|---|---|
| Mean daily | 24m | 32m | 55m |
| Weekday | 29m | 41m | 65m |
| Weekend | 10m | 10m | 30m |
These states can also be easily visualized in Figure 5.
dataVEET |>
gg_day(y.axis = Lux,
y.axis.label = "Illuminance (lx)",
geom = "line",
jco_color = FALSE) |>
gg_state(Lux_range, aes_fill = Lux_range, alpha = 0.75) |>
gg_photoperiod(coordinates) +
scale_fill_viridis_d() +
labs(fill = "Illuminance conditions") +
theme(legend.position = "bottom")4.2.3 Changes indoor to outdoor
To calculate the number of times a change from indoor to outdoor happens, we can extract all states where this is the case in Table 14.
dataVEET |>
extract_states(Outdoor, Lux >= 1000, #get all instances of states and non-states
group.by.state = FALSE) |> #don't group output by the state
filter(!lead(Outdoor), Outdoor) |> #keep where the prior state is FALSE
summarize_numeric(
prefix = "mean ",
remove = c("Datetime", "Outdoor", "start", "end", "duration"),
add.total.duration = FALSE
) |>
mean_daily(prefix = "") |>
gt() |> fmt_number(episodes, decimals = 0) #table| Day | mean epoch | episodes |
|---|---|---|
| Mean daily | 5s | 64 |
| Weekday | 5s | 72 |
| Weekend | 5s | 46 |
This seems rather high and is certainly influenced by the small interval of 5 seconds. Requiring that the time outside has to at least persist for 5 minutes (and slight interruptions) will bring this number down. See Table 15 for comparison.
dataVEET |>
extract_clusters(Lux >= 1000, #cluster conditions
cluster.duration = "5 min", #require 1 minute durations
interruption.duration = "20 secs", #allow for short interruptions
return.only.clusters = FALSE) |> #get all instances of clusters and non-clusters
filter(!lead(is.cluster), is.cluster) |> #keep where the prior state is FALSE
summarize_numeric(
prefix = "mean ",
remove = c("Datetime", "start", "end", "duration"),
add.total.duration = FALSE
) |>
mean_daily(prefix = "") |>
gt() |> fmt_number(episodes, decimals = 0) #table| Day | mean epoch | episodes |
|---|---|---|
| Mean daily | 5s | 5 |
| Weekday | 5s | 6 |
| Weekend | 5s | 4 |
4.2.4 Longest period above 1000 lx
The last Light aspect from Table 1 is the longest period above 1000 lx (PAT1000). While this can be calculated based on what we have shown above by combining extract_states() with a simple filter for maximal duration, this metric provides a good opportunity to show that some aspects can also be calculated with dedicated metric functions in LightLogR. In this case, we use period_above_threshold(). The benefit of this approach is that multiple metrics can be calculated at once. Here, for example, we also calculate the duration above 1000 lx (TAT1000) alongside in Table 16
dataVEET |>
summarize(PAT1000 =
period_above_threshold(Lux,
Datetime,
threshold = 1000,
na.rm = TRUE),
TAT1000 =
duration_above_threshold(Lux,
Datetime,
threshold = 1000,
na.rm = TRUE),
.groups = "drop_last") |>
mean_daily(prefix = "") |>
gt()| Day | PAT1000 | TAT1000 |
|---|---|---|
| Mean daily | 1987s (~33.12 minutes) | 6709s (~1.86 hours) |
| Weekday | 2501s (~41.68 minutes) | 8164s (~2.27 hours) |
| Weekend | 702s (~11.7 minutes) | 3070s (~51.17 minutes) |
4.3 Spectrum
Spectral data is not part of any of the datasets used in this article. Rather, it has to be reconstructed from sensor counts and a calibration matrix. The VEET device contains ten sensor channels that can be used for reconstruction. As these are contained in a different sensor than the ambient light sensor, a different modality needs to be imported from the same file. Data preparation will be analogous to Light. PHO contains the data from spectral sensor channels. For computational reasons, the data will be aggregated to 5 minutes intervals. The first three rows are shown in Table 17.
dataVEET <- import$VEET(path, tz = tz, modality = "PHO", manual.id = "VEET")
Successfully read in 304'197 observations across 1 Ids from 1 VEET-file(s).
Timezone set is US/Central.
The system timezone is Europe/Berlin. Please correct if necessary!
First Observation: 2024-06-04 15:00:36
Last Observation: 2024-06-12 08:29:43
Timespan: 7.7 days
Observation intervals:
Id interval.time n pct
1 VEET 0s 1 0.00033%
2 VEET 1s 1753 0.57627%
3 VEET 2s 300556 98.80340%
4 VEET 3s 1873 0.61572%
5 VEET 4s 3 0.00099%
6 VEET 6s 1 0.00033%
7 VEET 7s 2 0.00066%
8 VEET 9s 5 0.00164%
9 VEET 109s (~1.82 minutes) 1 0.00033%
10 VEET 59077s (~16.41 hours) 1 0.00033%
dataVEET <-
dataVEET |>
aggregate_Datetime(unit = "5 mins") |> #aggregate to 5 minute intervals
gap_handler(full.days = TRUE) |> #set implicit gaps to explicit gaps
group_by(Day = date(Datetime)) |> #group data by day
remove_partial_data(Gain, threshold.missing = "1 hour") #remove bad days
dataVEET |> head(3) |> gt() |> fmt_number(s415:ClearR)| Id | Datetime | is.implicit | time_stamp | integration_time | Gain | s415 | s445 | s480 | s515 | s555 | s590 | s630 | s680 | s940 | Dark | ClearL | ClearR | file.name | modality |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2024-06-06 | |||||||||||||||||||
| VEET | 2024-06-06 | FALSE | 1717649999 | 50 | 512 | 0.14 | 4.77 | 5.89 | 12.95 | 27.06 | 41.85 | 54.49 | 33.61 | 14.67 | 0.00 | 92.84 | 95.39 | 01_VEET_L | PHO |
| VEET | 2024-06-06 00:05:00 | FALSE | 1717650299 | 50 | 512 | 0.14 | 4.69 | 5.87 | 12.95 | 26.82 | 41.33 | 54.07 | 33.25 | 14.69 | 0.00 | 91.93 | 94.81 | 01_VEET_L | PHO |
| VEET | 2024-06-06 00:10:00 | FALSE | 1717650599 | 50 | 512 | 0.14 | 4.73 | 5.94 | 12.98 | 27.39 | 42.43 | 55.07 | 33.80 | 15.15 | 0.00 | 93.63 | 96.57 | 01_VEET_L | PHO |
The channels s415 through ClearR contain raw sensor counts and need to be normalized by the Gain value. Further, ClearL and ClearR need to be averaged prior to spectral reconstruction. The appropriate gain.ratio.table for the sensor TSL2585 is integrated in LightLogR, but should also be confirmed by the manufacturer.
count.columns <- c("s415", "s445", "s480", "s515", "s555", "s590", "s630",
"s680", "s940", "Dark", "ClearL", "ClearR") #column names
#normalize data
dataVEET <-
dataVEET |>
normalize_counts( #function to normalize counts
gain.columns = rep("Gain", 12), #all sensor channels share the gain value
count.columns = count.columns, #senso channels to normalize
gain.ratio.tables$TSL2585 #gain ratio channel for TSL2585 sensor
)
#average Clear Channels
dataVEET <-
dataVEET |>
mutate(Clear.normalized = (ClearL+ClearR)/2)
#remove raw sensor counts and rename normalized values
dataVEET <-
dataVEET |>
select(-c(s415:ClearR)) |>
rename_with(\(x) str_remove(x, ".normalized"))This closes the necessary preparation in the dataset. The calibration matrix was provided by the manufacturer and is specific to the make and model. It should not be used for research purposes without confirming its accuracy with the manufacturer.
#import calibration matrix
calib_mtx <-
read_csv("data/VEET_calibration_matrix.csv", show_col_types = FALSE) |>
column_to_rownames("wavelength") |>
as.matrix()Construction the spectrum is now straightforward.
dataVEET <-
dataVEET |>
mutate(Spectrum =
spectral_reconstruction(
sensor_channels = pick(s415:s940, Clear),
calibration_matrix = calib_mtx
)
)The dataset now contains a list-column containing the spectrum for each observation. We can visualize the data in Figure 6.
dataVEET |>
unnest(Spectrum) |> #unnest the list column
group_by(Datetime) |> #group by each spectrum
mutate(irradiance = irradiance/max(irradiance)) |> #scale spectra relative
ggplot(aes(x=wavelength, y = irradiance, group = Datetime)) + #plot
geom_path(alpha = 0.15) +
theme_minimal() +
labs(y = "Relative spectral irradiance (%)", x = "Wavelength (nm)") +
scale_y_continuous(labels = scales::label_percent())+
coord_cartesian(xlim = c(400, 700), ylim = c(0,1), expand = FALSE) +
theme(plot.margin = margin(10,20,10,10))These spectral data will be the basis to calculate the last two metrics.
4.3.1 Ratio of short vs. long wavelength light
The first spectral metric requires integration across two sections of the spectrum. spectral_integration() makes this task straight forward. The results can be seen in Table 18
dataVEET <-
dataVEET |>
select(Day, Datetime, Spectrum) |>
mutate(
short = Spectrum |> map_dbl(spectral_integration, #short wavelength
wavelength.range = c(400,500)),
long = Spectrum |> map_dbl(spectral_integration, #long wavelength
wavelength.range = c(600,700)),
`sl ratio` = short / long # calculate the ratio
)
dataVEET |>
summarize_numeric(prefix = "", remove = c("Datetime", "Spectrum")) |>
mean_daily(prefix = "") |>
gt() |> fmt_number(-`sl ratio`, decimals = 0) |> cols_hide(episodes) # table| Day | short | long | sl ratio |
|---|---|---|---|
| Mean daily | 37 | 102 | -0.7410509 |
| Weekday | 70 | 135 | -0.5457927 |
| Weekend | −46 | 20 | -1.2291965 |
4.3.2 Short-wavelength light at certain times of day
For the last metric we will look at only the short wavelength contribution (which was already calculated in the previous section), but do so through certain times of day. Table 19 shows the first approach, which is about exclusively looking at local time. Figure 7 expands this view to all hours of the day with a binned approach. Lastly, Table 20 focuses photoperiods.
dataVEET |>
filter_Time(start = "11:00:00", end = "14:00:00") |> #filter out certain times
select(-c(Spectrum, long, `sl ratio`, Time.data, Datetime)) |>
summarize_numeric(prefix = "") |>
mean_daily(prefix = "") |>
gt() |> fmt_number(short) |> cols_label(short = "Short wavelength irradiance")| Day | Short wavelength irradiance | episodes |
|---|---|---|
| Mean daily | −184.21 | 37 |
| Weekday | −202.23 | 37 |
| Weekend | −139.14 | 37 |
#creating the data
dataVEETtime <-
dataVEET |>
cut_Datetime(unit = "1 hour", #create time sections of one hour
type = "floor",
group_by = TRUE) |>
select(-c(Spectrum, long, `sl ratio`, Datetime)) |>
summarize_numeric(prefix = "") |>
group_by(Datetime.rounded, .drop = FALSE) |> #group by the time state
mean_daily(prefix = "", sub.zero = TRUE) |>
create_Timedata(Datetime.rounded) #add a time column for plotting
#creating the plot
dataVEETtime |>
ggplot(aes(x=Time.data, y = short/max(short))) +
geom_col(aes(fill = Day), position = "dodge") +
ggsci::scale_fill_jco() +
theme_minimal() +
labs(y = "Relative short wavelength contribution (%)",
x = "Local time (HH:MM)") +
scale_y_continuous(labels = scales::label_percent()) +
scale_x_time(labels = scales::label_time(format = "%H:%M"))dataVEET |>
select(-c(Spectrum, long, `sl ratio`)) |>
add_photoperiod(coordinates) |>
group_by(photoperiod.state, .add = TRUE) |>
summarize_numeric(prefix = "",
remove = c("dawn", "dusk", "photoperiod", "Datetime")) |>
group_by(photoperiod.state) |>
mean_daily(prefix = "") |>
select(-episodes) |>
pivot_wider(names_from =photoperiod.state, values_from = short) |>
gt() |> fmt_number()| Day | day | night |
|---|---|---|
| Mean daily | 80.14 | −27.73 |
| Weekday | 140.09 | −33.96 |
| Weekend | −69.73 | −12.17 |
5 Discussion and conclusion
This tutorial demonstrated how to derive at various metrics used in current and future research in a principled and standardized approach. While not brief overall, each metric has a dedicated pipeline to derive at the summaries that is easy to understand. Those pipelines utilize LightLogRs framework and combine that with common data analysis workflows. The goal is to make the process transparent (function definitions open source), accessible (sound documentation, tutorials, speaking function and argument names, MIT license), robust (over 400 unit tests for functions, continuous integration on GitHub, bug-tracking on github), and community-driven (feature tracking on GitHub, open process for researchers who want to contribute code or suggest features).
The tutorial also demonstrated that even through these standardized pipelines, there are many decisions a researcher has to make (and document) to clean data, deal with measurement epochs, and derive at the metrics, especially where clusters of data are concerned.
The slew of features aimed to explore the data and extracted metrics or clusters in plots and tables, and to handle measurement intervals, gaps, and irregular data, make LightLogR an excellent choice for the research field of visual experience, be it in circadian, myopia, or related fields of research.
6 Session info
sessionInfo()R version 4.4.3 (2025-02-28)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Berlin
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] gt_1.0.0 LightLogR_0.9.0 lubridate_1.9.4 forcats_1.0.0
[5] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.4 readr_2.1.5
[9] tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.2 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] gtable_0.3.6 xfun_0.52 htmlwidgets_1.6.4 tzdb_0.5.0
[5] vctrs_0.6.5 tools_4.4.3 generics_0.1.3 parallel_4.4.3
[9] proxy_0.4-27 pkgconfig_2.0.3 KernSmooth_2.23-26 RColorBrewer_1.1-3
[13] lifecycle_1.0.4 compiler_4.4.3 farver_2.1.2 suntools_1.0.1
[17] ggsci_3.2.0 janitor_2.2.1 snakecase_0.11.1 litedown_0.7
[21] class_7.3-23 htmltools_0.5.8.1 sass_0.4.10 yaml_2.3.10
[25] pillar_1.10.2 crayon_1.5.3 classInt_0.4-11 commonmark_1.9.5
[29] tidyselect_1.2.1 digest_0.6.37 stringi_1.8.7 sf_1.0-20
[33] labeling_0.4.3 cowplot_1.1.3 fastmap_1.2.0 grid_4.4.3
[37] archive_1.1.12 cli_3.6.5 magrittr_2.0.3 base64enc_0.1-3
[41] utf8_1.2.4 e1071_1.7-16 withr_3.0.2 scales_1.4.0
[45] bit64_4.6.0-1 timechange_0.3.0 rmarkdown_2.29 bit_4.6.0
[49] ggtext_0.1.2 hms_1.1.3 evaluate_1.0.3 knitr_1.50
[53] viridisLite_0.4.2 markdown_2.0 rlang_1.1.6 gridtext_0.1.5
[57] Rcpp_1.0.14 glue_1.8.0 DBI_1.2.3 xml2_1.3.8
[61] rstudioapi_0.17.1 vroom_1.6.5 jsonlite_2.0.0 R6_2.6.1
[65] units_0.8-7
7 References
Footnotes
Functions from
LightLogRare presented as links to the function documentation. General analysis functions (from packagedplyr) are presented as normal text.↩︎This deviates from the common definition of luminous exposure, which is the sum of illuminance measurements scaled to hourly observation intervals↩︎